Languages and Compilers for Parallel Computing

chapter

CUDA-Lite: Reducing GPU Programming Complexity

Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, Wen-mei W. Hwu

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 1-15

The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct...

chapter

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

John A. Stratton, Sam S. Stone, Wen-mei W. Hwu

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 16-30

CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set...

chapter

Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Nikola Vujić, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 31-46

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid...

chapter

Efficient Set Sharing Using ZBDDs

Mario Méndez-Lojo, Ondřej Lhoták, Manuel V. Hermenegildo

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 47-63

Set sharing is an abstract domain in which each concrete object is represented by the set of local variables from which it might be reachable. It is a useful abstraction to detect parallelism opportunities, since it contains definite information about which variables do not share in memory, i.e., about when the memory regions reachable from those variables are disjoint. Set sharing is a more precise...

chapter

Register Bank Assignment for Spatially Partitioned Processors

Behnam Robatmili, Katherine Coons, Doug Burger, Kathryn S. McKinley

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 64-79

Demand for instruction level parallelism calls for increasing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distributed banks, which offers a technology scalable substrate but a challenging compilation target. This paper introduces a register allocator for spatially partitioned architectures. The...

chapter

Smashing: Folding Space to Tile through Time

Nissa Osheim, Michelle Mills Strout, Dave Rostron, Sanjay Rajopadhye

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 80-93

Partial differential equation solvers spend most of their computation time performing nearest neighbor (stencil) computations on grids that model spatial domains. Tiling is an effective performance optimization for improving the data locality and enabling course-grain parallelization for such computations. However, when the domains are periodic, tiling through time is not directly applicable due to...

chapter

Identification of Heap–Carried Data Dependence Via Explicit Store Heap Models

Mark Marron, Darko Stefanovic, Deepak Kapur, Manuel Hermenegildo

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 94-108

Dependence information between program values is extensively used in many program optimization techniques. The ability to identify statements, calls and loop iterations that do not depend on each other enables many transformations which increase the instruction and thread-level parallelism in a program. When program variables contain complex data structures including arrays, records, and recursive...

chapter

On the Scalability of an Automatically Parallelized Irregular Application

Martin Burtscher, Milind Kulkarni, Dimitrios Prountzos, Keshav Pingali

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 109-123

Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error...

chapter

Statistically Analyzing Execution Variance for Soft Real-Time Applications

Tushar Kumar, Romain Cledat, Jaswanth Sreeram, Santosh Pande

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 124-140

Certain high-performance applications like multimedia and gaming have performance requirements beyond reducing program execution time. These applications have repetitive components whose desired performance characteristics are more naturally expressed using soft real-time theory with its probabilistic guarantees. However, for large complex gaming and multimedia applications, programmers typically...

chapter

Minimum Lock Assignment: A Method for Exploiting Concurrency among Critical Sections

Yuan Zhang, Vugranam C. Sreedhar, Weirong Zhu, Vivek Sarkar, more

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 141-155

In this paper we propose a lock assignment technique to simplify the mutual exclusion enforcement in multithreaded programs. Programmers are allowed to annotate the regions of code that are expected to be mutually exclusive as critical sections, without using explicit locks. The compiler then automatically infers an assignment of the minimum number of locks to critical sections by solving the Minimum...

chapter

Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS)

Cosmin E. Oancea, Alan Mycroft

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 156-171

The move to multi-core has increased interest in parallelizing sequential programs. Classical dependency-based techniques, although successful for some classes of programs, often fail due to the one-sided (conservative) approximation of program behavior. Thread-level speculation enables increased parallelism by allowing out-of-order execution: correct dependences are ensured by run-time monitoring...

chapter

Thread Safety through Partitions and Effect Agreements

Nicholas D. Matsakis, Thomas R. Gross

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 172-186

This paper describes a safety analysis for a multithreaded system based upon transactional memory. The analysis guarantees that shared data is always read and written from within a transaction, while allowing for unsynchronized access to thread-local and (shared) read-only data, as well as the migration of data between threads. The analysis is based on a type and effect system for object-oriented...

chapter

P-Ray: A Software Suite for Multi-core Architecture Characterization

Alexandre X. Duchateau, Albert Sidelnik, María Jesús Garzarán, David Padua

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 187-201

The increasing complexity of computer architectures has made the approach of automatically generating code that is optimized for the target machine a growing area of interest. Examples of such systems are library generators, such as ATLAS, SPIRAL, and FFTW. To generate optimized code without manual intervention, these systems need to know the values of certain hardware parameters, such as the cache...

chapter

Scalable Implementation of Efficient Locality Approximation

Xipeng Shen, Jonathan Shaw

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 202-216

As memory hierarchy becomes deeper and shared by more processors, locality increasingly determines system performance. As a rigorous and precise locality model, reuse distance has been used in program optimizations, performance prediction, memory disambiguation, and locality phase prediction. However, the high cost of measurement has been severely impeding its uses in scenarios requiring high efficiency,...

chapter

P-OPT: Program-Directed Optimal Cache Management

Xiaoming Gu, Tongxin Bai, Yaoqing Gao, Chengliang Zhang, more

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 217-231

As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two...

chapter

Compiler-Driven Dependence Profiling to Guide Program Parallelization

Peng Wu, Arun Kejariwal, Călin Caşcaval

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 232-248

As hardware systems move toward multicore and multithreaded architectures, programmers increasingly rely on automated tools to help with both the parallelization of legacy codes and effective exploitation of all available hardware resources. Thread-level speculation (TLS) has been proposed as a technique to parallelize the execution of serial codes or serial sections of parallel codes. One of the...

chapter

gluepy: A Simple Distributed Python Programming Framework for Complex Grid Environments

Ken Hironaka, Hideo Saito, Kei Takahashi, Kenjiro Taura

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 249-263

Problem-solving frameworks in large-scale and wide-area environments must handle connectivity issues (NATs and firewalls), maintain scalability with respect to connection management, accommodate dynamic processes joining/leaving at runtime, and provide simple means to tolerate communication/node failures. All of the above must be presented in a simple and flexible programming model. This paper designs...

chapter

A Fully Parallel LISP2 Compactor with Preservation of the Sliding Properties

Xiao-Feng Li, Ligang Wang, Chen Yang

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 264-278

Compacting garbage collector (GC) is widely used due to its good properties of in-place collection and heap de-fragmentation. In addition, it supports fast bump-pointer allocation and provides good access locality. Most known commercial JVM or CLR implementations use compaction algorithm in certain garbage collection scenarios, such as in full heap or mature object space collections. LISP2 compactor...

chapter

A Case Study in Tightly Coupled Multi-paradigm Parallel Programming

Sayantan Chakravorty, Aaron Becker, Terry Wilmarth, Laxmikant Kalé

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 279-291

Programming paradigms are designed to express algorithms elegantly and efficiently. There are many parallel programming paradigms, each suited to a certain class of problems. Selecting the best parallel programming paradigm for a problem minimizes programming effort and maximizes performance. Given the increasing complexity of parallel applications, no one paradigm may be suitable for all components...

chapter

ASYNC Loop Constructs for Relaxed Synchronization

Russell Meyers, Zhiyuan Li

Lecture Notes in Computer Science > Languages and Compilers for Parallel Computing > 292-303

Conventional iterative solvers for partial differential equations impose strict data dependencies between each solution point and its neighbors. When implemented in OpenMP, they repeatedly execute barrier synchronization in each iterative step to ensure that data dependencies are strictly satisfied. We propose new parallel annotations to support an asynchronous computation model for iterative solvers...

INFONA - science communication portal

Languages and Compilers for Parallel Computing
21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers

CUDA-Lite: Reducing GPU Programming Complexity

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Efficient Set Sharing Using ZBDDs

Register Bank Assignment for Spatially Partitioned Processors

Smashing: Folding Space to Tile through Time

Identification of Heap–Carried Data Dependence Via Explicit Store Heap Models

On the Scalability of an Automatically Parallelized Irregular Application

Statistically Analyzing Execution Variance for Soft Real-Time Applications

Minimum Lock Assignment: A Method for Exploiting Concurrency among Critical Sections

Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS)

Thread Safety through Partitions and Effect Agreements

P-Ray: A Software Suite for Multi-core Architecture Characterization

Scalable Implementation of Efficient Locality Approximation

P-OPT: Program-Directed Optimal Cache Management

Compiler-Driven Dependence Profiling to Guide Program Parallelization

gluepy: A Simple Distributed Python Programming Framework for Complex Grid Environments

A Fully Parallel LISP2 Compactor with Preservation of the Sliding Properties

A Case Study in Tightly Coupled Multi-paradigm Parallel Programming

ASYNC Loop Constructs for Relaxed Synchronization

Filter options

Publication date

Keywords

INFONA - science communication portal

Languages and Compilers for Parallel Computing 21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

Languages and Compilers for Parallel Computing
21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers